Introduction to Gadfly


Gadfly is a library for plotting and visualization written in Julia. It is based largely on Hadley Wickhams's ggplot2 for R and Leland Wilkinson's book The Grammar of Graphics. Similar package in python is plotnine (https://realpython.com/ggplot-python/).

Some of the features are::

  • Renders publication quality graphics to SVG, PNG, Postscript, and PDF
  • Intuitive and consistent plotting interface
  • Works with Jupyter notebooks via IJulia out of the box
  • Tight integration with DataFrames.jl
  • Interactivity like panning, zooming, toggling powered by Snap.svg
  • Supports a large number of common plot types

Additional Recommended Resources:

In [1]:
using CSV
using DataFrames
using Statistics
using FreqTables
using StatsBase
using DataFramesMeta
using Gadfly
using NamedArrays ##For named arrays
In [2]:
ENV["COLUMNS"] = 1000
ENV["LINES"] = 20
Out[2]:
20

Case : Cars Data

This dataset is a slightly modified version of the dataset provided in the StatLib library. In line with the use by Ross Quinlan (1993) in predicting the attribute "mpg", 8 of the original instances were removed because they had unknown values for the "mpg" attribute. The original dataset is available in the file "auto-mpg.data-original".

"The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes." (Quinlan, 1993)

Attribute Information:

  1. mpg: continuous
  2. cylinders: multi-valued discrete
  3. displacement: continuous
  4. horsepower: continuous
  5. weight: continuous
  6. acceleration: continuous
  7. model year: multi-valued discrete
  8. origin: multi-valued discrete (1: American, 2: European 3: Japanese)
  9. car name: string (unique for each instance)

More info: https://archive.ics.uci.edu/ml/datasets/Auto+MPG

Read the data

Read the data from the url source (fixed-width formatted lines)

In [3]:
homedir()
Out[3]:
"/Users/Rahul"
In [4]:
pwd()
Out[4]:
"/Users/Rahul/Documents/Rahul Office/IIMB/Concepts/Julia/ML_using_Julia/Julia_Code/Julia_Practice"
In [5]:
#autos_df = pd.read_csv('./data/autos_df.csv',
                       #index_col=['car_name'])
#autos_df.head()
In [6]:
autos_df = CSV.read("./data/autos_df.csv", DataFrame)
Out[6]:

398 rows × 9 columns

car_namempgcylindersdisplacementhorsepowerweightaccelerationmodel_yearorigin
StringFloat64Int64Float64Float64?Float64Float64Int64Int64
1chevrolet chevelle malibu18.08307.0130.03504.012.0701
2buick skylark 32015.08350.0165.03693.011.5701
3plymouth satellite18.08318.0150.03436.011.0701
4amc rebel sst16.08304.0150.03433.012.0701
5ford torino17.08302.0140.03449.010.5701
6ford galaxie 50015.08429.0198.04341.010.0701
7chevrolet impala14.08454.0220.04354.09.0701
8plymouth fury iii14.08440.0215.04312.08.5701
9pontiac catalina14.08455.0225.04425.010.0701
10amc ambassador dpl15.08390.0190.03850.08.5701
11dodge challenger se15.08383.0170.03563.010.0701
12plymouth 'cuda 34014.08340.0160.03609.08.0701
13chevrolet monte carlo15.08400.0150.03761.09.5701
14buick estate wagon (sw)14.08455.0225.03086.010.0701
15toyota corona mark ii24.04113.095.02372.015.0703
16plymouth duster22.06198.095.02833.015.5701
17amc hornet18.06199.097.02774.015.5701
18ford maverick21.06200.085.02587.016.0701
19datsun pl51027.0497.088.02130.014.5703
20volkswagen 1131 deluxe sedan26.0497.046.01835.020.5702
In [7]:
#autos_df.info()
eltypes(autos_df)
Out[7]:
9-element Array{Type,1}:
 String
 Float64
 Int64
 Float64
 Union{Missing, Float64}
 Float64
 Float64
 Int64
 Int64

Typecast

  • convert converts from one Julia type to another, for things that “behave the same”. Float32 and Int64 are both * numbers, for the most part. Whereas a String and an Int64 are different things entirely.
In [8]:
autos_df[:,:origin] = convert.(Float64, autos_df[:,:origin])
Out[8]:
398-element Array{Float64,1}:
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 ⋮
 1.0
 1.0
 1.0
 2.0
 1.0
 1.0
 1.0
In [9]:
autos_df[:,:origin] = convert.(Int64, autos_df[:,:origin])
Out[9]:
398-element Array{Int64,1}:
 1
 1
 1
 1
 1
 1
 1
 1
 ⋮
 1
 1
 1
 2
 1
 1
 1
  • string turns things into strings
In [10]:
autos_df[:, :s_origin] = string.(autos_df[:,:origin])
Out[10]:
398-element Array{String,1}:
 "1"
 "1"
 "1"
 "1"
 "1"
 "1"
 "1"
 "1"
 ⋮
 "1"
 "1"
 "1"
 "2"
 "1"
 "1"
 "1"
  • parse turns a string into a Julia type. It only works on strings and not other types.
In [11]:
autos_df[:, :origin] = parse.(Int64, autos_df[:,:s_origin]);
In [12]:
autos_df[!, :origin] = parse.(Float64, autos_df[:,:s_origin])
Out[12]:
398-element Array{Float64,1}:
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 ⋮
 1.0
 1.0
 1.0
 2.0
 1.0
 1.0
 1.0
In [13]:
head(autos_df,2)

#head(autos_df,2)
Out[13]:

2 rows × 10 columns

car_namempgcylindersdisplacementhorsepowerweightaccelerationmodel_yearorigins_origin
StringFloat64Int64Float64Float64?Float64Float64Int64Float64String
1chevrolet chevelle malibu18.08307.0130.03504.012.0701.01
2buick skylark 32015.08350.0165.03693.011.5701.01

Describe the features

Get the mean, median, mode etc

In [14]:
#autos_df.describe()

describe(autos_df)
Out[14]:

10 rows × 8 columns

variablemeanminmedianmaxnuniquenmissingeltype
SymbolUnion…AnyUnion…AnyUnion…Union…Type
1car_nameamc ambassador broughamvw rabbit custom305String
2mpg23.51469.023.046.6Float64
3cylinders5.4547734.08Int64
4displacement193.42668.0148.5455.0Float64
5horsepower104.46946.093.5230.06Union{Missing, Float64}
6weight2970.421613.02803.55140.0Float64
7acceleration15.56818.015.524.8Float64
8model_year76.01017076.082Int64
9origin1.572861.01.03.0Float64
10s_origin133String

horsepower seems to have missing values

In [15]:
#autos_df["horsepower"].isnull().values.any()
#autos_df[autos_df.horsepower.isnull()]

any(ismissing.(autos_df.horsepower))
Out[15]:
true
In [16]:
#autos_df = autos_df.dropna()

autos_df = dropmissing(autos_df,:)
Out[16]:

392 rows × 10 columns

car_namempgcylindersdisplacementhorsepowerweightaccelerationmodel_yearorigins_origin
StringFloat64Int64Float64Float64Float64Float64Int64Float64String
1chevrolet chevelle malibu18.08307.0130.03504.012.0701.01
2buick skylark 32015.08350.0165.03693.011.5701.01
3plymouth satellite18.08318.0150.03436.011.0701.01
4amc rebel sst16.08304.0150.03433.012.0701.01
5ford torino17.08302.0140.03449.010.5701.01
6ford galaxie 50015.08429.0198.04341.010.0701.01
7chevrolet impala14.08454.0220.04354.09.0701.01
8plymouth fury iii14.08440.0215.04312.08.5701.01
9pontiac catalina14.08455.0225.04425.010.0701.01
10amc ambassador dpl15.08390.0190.03850.08.5701.01
11dodge challenger se15.08383.0170.03563.010.0701.01
12plymouth 'cuda 34014.08340.0160.03609.08.0701.01
13chevrolet monte carlo15.08400.0150.03761.09.5701.01
14buick estate wagon (sw)14.08455.0225.03086.010.0701.01
15toyota corona mark ii24.04113.095.02372.015.0703.03
16plymouth duster22.06198.095.02833.015.5701.01
17amc hornet18.06199.097.02774.015.5701.01
18ford maverick21.06200.085.02587.016.0701.01
19datsun pl51027.0497.088.02130.014.5703.03
20volkswagen 1131 deluxe sedan26.0497.046.01835.020.5702.02
In [17]:
#autos_df[["mpg", "displacement","horsepower","weight","acceleration"]].describe()

describe(autos_df[:,["mpg", "displacement","horsepower","weight","acceleration"]])
Out[17]:

5 rows × 8 columns

variablemeanminmedianmaxnuniquenmissingeltype
SymbolFloat64Float64Float64Float64NothingNothingDataType
1mpg23.44599.022.7546.6Float64
2displacement194.41268.0151.0455.0Float64
3horsepower104.46946.093.5230.0Float64
4weight2977.581613.02803.55140.0Float64
5acceleration15.54138.015.524.8Float64

What is the average miles per gallon for car with different origin

In [18]:
#autos_df.groupby('origin')['mpg'].mean()

combine(groupby(autos_df,:s_origin), :mpg .=> [mean])
Out[18]:

3 rows × 2 columns

s_originmpg_mean
StringFloat64
1120.0335
2330.4506
3227.6029

Plot

Using Gadfly for different plots.

Distribution Plot

Histogram plot

Histogram is a tool to visualize one dimensional data which is continous in nature. Given a collection of single random variables:

  • Choose a interval (bins in which the entire dataset can be bucketed)
  • Count the data points within each bin (the y axis represents the frequency count)

plot function with Geom.histogram frokm gadfly package will render a histogram.

  • Plot the distribution of miles per gallon for cars with American origin and Japanese origin
In [19]:
autos_df[autos_df.s_origin .== "1",:]
Out[19]:

245 rows × 10 columns

car_namempgcylindersdisplacementhorsepowerweightaccelerationmodel_yearorigins_origin
StringFloat64Int64Float64Float64Float64Float64Int64Float64String
1chevrolet chevelle malibu18.08307.0130.03504.012.0701.01
2buick skylark 32015.08350.0165.03693.011.5701.01
3plymouth satellite18.08318.0150.03436.011.0701.01
4amc rebel sst16.08304.0150.03433.012.0701.01
5ford torino17.08302.0140.03449.010.5701.01
6ford galaxie 50015.08429.0198.04341.010.0701.01
7chevrolet impala14.08454.0220.04354.09.0701.01
8plymouth fury iii14.08440.0215.04312.08.5701.01
9pontiac catalina14.08455.0225.04425.010.0701.01
10amc ambassador dpl15.08390.0190.03850.08.5701.01
11dodge challenger se15.08383.0170.03563.010.0701.01
12plymouth 'cuda 34014.08340.0160.03609.08.0701.01
13chevrolet monte carlo15.08400.0150.03761.09.5701.01
14buick estate wagon (sw)14.08455.0225.03086.010.0701.01
15plymouth duster22.06198.095.02833.015.5701.01
16amc hornet18.06199.097.02774.015.5701.01
17ford maverick21.06200.085.02587.016.0701.01
18amc gremlin21.06199.090.02648.015.0701.01
19ford f25010.08360.0215.04615.014.0701.01
20chevy c2010.08307.0200.04376.015.0701.01
In [20]:
#sn.distplot(autos_df[autos_df.origin == 1]['mpg'], label = 'American')
#plt.legend()
In [21]:
set_default_plot_size(24cm, 12cm)
In [22]:
origin_filter = autos_df[autos_df.s_origin .== "1",:]

hp1 = plot(origin_filter, 
            x="mpg", 
            Geom.histogram, 
    
            Guide.title("Cars of American Origin - 1"))
Out[22]:
mpg -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 -50 0 50 100 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 -20 0 20 40 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 Cars of American Origin - 1
In [23]:
hp2 = plot(origin_filter, 
            x="mpg", 
            Geom.histogram(bincount=30), 
            
            Guide.title("Cars of American origin - 2"))
Out[23]:
mpg -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 -50 0 50 100 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 -25 -24 -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 -25 0 25 50 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 Cars of American origin - 2

Plot the distribution of miles per gallon for cars from all origin

In [24]:
#sn.distplot(autos_df[autos_df.origin == 1]['mpg'],label = 'American', hist=True)
#sn.distplot(autos_df[autos_df.origin == 2]['mpg'], label = 'European',hist = True)
#sn.distplot(autos_df[autos_df.origin == 3]['mpg'], label = 'Japenese',hist = True)
#plt.legend()
In [25]:
##If plotting mpg for cars from american, european and japanese make.
hp3 = plot(autos_df, 
            x=:mpg,  
            color=:s_origin,
    
            Geom.histogram(position=:identity, bincount=40),
    
            Scale.color_discrete_manual("skyblue","red","green"),
            Theme(alphas=[0.5], discrete_highlight_color=identity),

            Guide.title("Cars with all origins"))
Out[25]:
mpg -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 -50 0 50 100 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 1 3 2 s_origin h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -40 -30 -20 -10 0 10 20 30 40 50 60 70 -30 -29 -28 -27 -26 -25 -24 -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 -30 0 30 60 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 Cars with all origins
In [26]:
hstack(hp1, hp2, hp3)
Out[26]:
mpg -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 -50 0 50 100 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 1 3 2 s_origin h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -40 -30 -20 -10 0 10 20 30 40 50 60 70 -30 -29 -28 -27 -26 -25 -24 -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 -30 0 30 60 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 Cars with all origins mpg -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 -50 0 50 100 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 -25 -24 -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 -25 0 25 50 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 Cars of American origin - 2 mpg -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 -60 -58 -56 -54 -52 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 -100 0 100 200 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 -20 0 20 40 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 Cars of American Origin - 1
  • Plot the histogram along with density
In [27]:
#Histogram with density plot
hp4 = plot(autos_df, 
            x=:mpg,  
            color=:s_origin,
    
            Geom.density(), 
            Geom.histogram(position=:identity, density = true, bincount=40),
    
            Guide.title("Histogram and Density Plot-1"),
    
            Scale.color_discrete_manual("skyblue","red","green"),
            Theme(alphas=[0.5], discrete_highlight_color=identity))
Out[27]:
mpg -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 -60 -58 -56 -54 -52 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 -100 0 100 200 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 1 3 2 s_origin h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 -0.20 -0.19 -0.18 -0.17 -0.16 -0.15 -0.14 -0.13 -0.12 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 -0.2 0.0 0.2 0.4 -0.20 -0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 Histogram and Density Plot-1
In [28]:
#The other way to write p4
hp5 = plot(autos_df, 
            x=:mpg,  
            color=:s_origin,
    
            layer(Geom.density(), Geom.histogram(position=:identity, density = true, bincount=40)),
    
            Guide.title("Histogram and Density Plot-1"),
    
            Scale.color_discrete_manual("skyblue","red","green"),
            Theme(alphas=[0.5], discrete_highlight_color=identity))
Out[28]:
mpg -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 -60 -58 -56 -54 -52 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 -100 0 100 200 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 1 3 2 s_origin h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 -0.20 -0.19 -0.18 -0.17 -0.16 -0.15 -0.14 -0.13 -0.12 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 -0.2 0.0 0.2 0.4 -0.20 -0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 Histogram and Density Plot-1
In [29]:
gridstack([hp4 hp5])
Out[29]:
mpg -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 -60 -58 -56 -54 -52 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 -100 0 100 200 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 1 3 2 s_origin h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 -0.20 -0.19 -0.18 -0.17 -0.16 -0.15 -0.14 -0.13 -0.12 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 -0.2 0.0 0.2 0.4 -0.20 -0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 Histogram and Density Plot-1 mpg -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 -60 -58 -56 -54 -52 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 -100 0 100 200 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 1 3 2 s_origin h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 -0.155 -0.150 -0.145 -0.140 -0.135 -0.130 -0.125 -0.120 -0.115 -0.110 -0.105 -0.100 -0.095 -0.090 -0.085 -0.080 -0.075 -0.070 -0.065 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165 0.170 0.175 0.180 0.185 0.190 0.195 0.200 0.205 0.210 0.215 0.220 0.225 0.230 0.235 0.240 0.245 0.250 0.255 0.260 0.265 0.270 0.275 0.280 0.285 0.290 0.295 0.300 0.305 -0.2 0.0 0.2 0.4 -0.16 -0.15 -0.14 -0.13 -0.12 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 Histogram and Density Plot-1
In [30]:
## Density plot using Geom.polygon 
hp6 = plot(autos_df, 
            x=:mpg,  
            color=:s_origin,
    
            Stat.density(),
            Geom.polygon(fill=true, preserve_order=true),
            Geom.histogram(position=:identity, density = true, bincount=40),
    
            Guide.title("Histogram and Density Plot-1"),
            Scale.color_discrete_manual("skyblue","red","green"),
            Theme(alphas=[0.5], discrete_highlight_color=identity))
Out[30]:
mpg -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 -60 -58 -56 -54 -52 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 -100 0 100 200 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 1 3 2 s_origin h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 -0.20 -0.19 -0.18 -0.17 -0.16 -0.15 -0.14 -0.13 -0.12 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 -0.2 0.0 0.2 0.4 -0.20 -0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 Histogram and Density Plot-1
In [31]:
##The other way to write p6
hp7 = plot(autos_df, 
            x=:mpg,  
            color=:s_origin,
    
            layer(Geom.density(), Geom.histogram(position=:identity, density = true, bincount=40)),
            
            ##Comment the above line nd uncomment the below one to see the change.
            #layer(Stat.density(), Geom.polygon(fill = true, preserve_order = true)),
            #layer(Geom.histogram(position=:identity, density = true, bincount=40)),
    
            Guide.title("Histogram and Density Plot-1"),
    
            Scale.color_discrete_manual("skyblue","red","green"),
            Theme(alphas=[0.5], discrete_highlight_color=identity))
Out[31]:
mpg -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 -60 -58 -56 -54 -52 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 -100 0 100 200 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 1 3 2 s_origin h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 -0.155 -0.150 -0.145 -0.140 -0.135 -0.130 -0.125 -0.120 -0.115 -0.110 -0.105 -0.100 -0.095 -0.090 -0.085 -0.080 -0.075 -0.070 -0.065 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165 0.170 0.175 0.180 0.185 0.190 0.195 0.200 0.205 0.210 0.215 0.220 0.225 0.230 0.235 0.240 0.245 0.250 0.255 0.260 0.265 0.270 0.275 0.280 0.285 0.290 0.295 0.300 0.305 -0.2 0.0 0.2 0.4 -0.16 -0.15 -0.14 -0.13 -0.12 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 Histogram and Density Plot-1
In [32]:
gridstack([hp6 hp7])
Out[32]:
mpg -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 -60 -58 -56 -54 -52 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 -100 0 100 200 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 1 3 2 s_origin h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 -0.20 -0.19 -0.18 -0.17 -0.16 -0.15 -0.14 -0.13 -0.12 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 -0.2 0.0 0.2 0.4 -0.20 -0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 Histogram and Density Plot-1 mpg -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 -60 -58 -56 -54 -52 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 -100 0 100 200 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 1 3 2 s_origin h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 -0.20 -0.19 -0.18 -0.17 -0.16 -0.15 -0.14 -0.13 -0.12 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 -0.2 0.0 0.2 0.4 -0.20 -0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 Histogram and Density Plot-1
In [33]:
#sn.distplot(autos_df[autos_df.origin == 3]['mpg'],label = 'Japenese',hist=False)
#plt.legend()
In [34]:
#sn.distplot(autos_df['mpg'],hist = False)

Exercise

Plot the distribution of horsepower for cars with American origin and Japanese origin

In [ ]:

Kde Plot

Kernel density estimation(KDE) plot — plots a smooth curve shape of the distribution. It is a nonparametric estimation of density where inferences about the population is made from the finite data sample.

Parametric Data/Test: When the data is assumed to have been drawn from a particular distribution and some parametric test can be applied to it

Non-Parametric Data/Test: When we have no knowledge about the population and the underlying distribution

What is a Kernal?

Kernal: A kernel is a special type of probability density function (PDF) with the added property that it must be even. Thus, a kernel is a function with the following properties

  • non-negative
  • real-valued
  • even
  • its definite integral over its support set must equal to 1

Some common PDFs are kernels; they include the Uniform(-1,1) and standard normal distributions.

What is Kernal density estimation?

Kernel density estimation is a non-parametric method of estimating the probability density function (PDF) of a continuous random variable. It is non-parametric because it does not assume any underlying distribution for the variable. Essentially, at every datum, a kernel function is created with the datum at its centre – this ensures that the kernel is symmetric about the datum. The PDF is then estimated by adding all of these kernel functions and dividing by the number of data to ensure that it satisfies the 2 properties of a PDF:

  • Every possible value of the PDF (i.e. the function, f(x)), is non-negative.
  • The definite integral of the PDF over its support set equals to 1.

Steps in estimating kernal density:

  • Each observation is first replaced with a normal (Gaussian) curve centered at that value.
  • These curves are summed to compute the value of the density at each point in the support grid. The resulting curve is then normalized so that the area under it is equal to 1

More about KDE at:

  • Comparing mpg distributions of cars by different origins
In [35]:
#sn.distplot(autos_df[autos_df.origin == 1]['mpg'],hist=False, label = 'American')
#sn.distplot(autos_df[autos_df.origin == 2]['mpg'], hist = False, label = 'European')
#sn.distplot(autos_df[autos_df.origin == 3]['mpg'], hist = False, label = 'Japenese')
In [36]:
# Just the density plot
dp1 = plot(autos_df, 
            x=:mpg,  
            color=:s_origin,
    
            layer(Stat.density(), Geom.polygon(fill=true, preserve_order=true)),
    
            Scale.color_discrete_manual("skyblue","red","green"),
            Theme(alphas=[0.5], discrete_highlight_color=identity))
Out[36]:
mpg -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 -60 -58 -56 -54 -52 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 -100 0 100 200 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 1 3 2 s_origin h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 -0.080 -0.075 -0.070 -0.065 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165 -0.1 0.0 0.1 0.2 -0.080 -0.075 -0.070 -0.065 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165
In [37]:
#Density plot with central 90% confidence limit.

dp2 = plot(autos_df, 
            x=:mpg,  
            color=:s_origin,
    
            #layer(Stat.density, Geom.polygon(fill=true, preserve_order=true), alpha=[0.4]),
            layer(Geom.density()),
            layer(Stat.quantile_bars(quantiles=[0.05, 0.95]), Geom.segment),
    
            Guide.title("1st Density plot with 90% CI"),
    
            Scale.color_discrete_manual("skyblue","red","green"),
            Theme(alphas=[0.2], discrete_highlight_color=identity))
Out[37]:
mpg -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 -60 -58 -56 -54 -52 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 -100 0 100 200 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 1 3 2 s_origin h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 -0.080 -0.075 -0.070 -0.065 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165 -0.1 0.0 0.1 0.2 -0.080 -0.075 -0.070 -0.065 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165 1st Density plot with 90% CI
In [38]:
##The other way
dp3 = plot(autos_df, 
            x=:mpg,  
            color=:s_origin,
    
            layer(Stat.density, Geom.polygon(fill=true, preserve_order=true) ), #alpha=[0.4]
            layer(Stat.quantile_bars(quantiles=[0.05, 0.95]), Geom.segment),
    
            Guide.title("2nd Density plot with 90% CI"),
    
            Scale.color_discrete_manual("skyblue","red","green"),
            Theme(alphas=[0.4], discrete_highlight_color=identity))
Out[38]:
mpg -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 -60 -58 -56 -54 -52 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 -100 0 100 200 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 1 3 2 s_origin h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 -0.080 -0.075 -0.070 -0.065 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165 -0.1 0.0 0.1 0.2 -0.080 -0.075 -0.070 -0.065 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165 2nd Density plot with 90% CI
In [39]:
#hstack(dp1,dp2,dp3)
gridstack([dp1 dp2 dp3])
Out[39]:
mpg -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 -60 -58 -56 -54 -52 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 -100 0 100 200 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 1 3 2 s_origin h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 -0.080 -0.075 -0.070 -0.065 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165 -0.1 0.0 0.1 0.2 -0.080 -0.075 -0.070 -0.065 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165 2nd Density plot with 90% CI mpg -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 -60 -58 -56 -54 -52 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 -100 0 100 200 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 1 3 2 s_origin h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 -0.080 -0.075 -0.070 -0.065 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165 -0.1 0.0 0.1 0.2 -0.080 -0.075 -0.070 -0.065 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165 1st Density plot with 90% CI mpg -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 -60 -58 -56 -54 -52 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 -100 0 100 200 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 1 3 2 s_origin h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 -0.080 -0.075 -0.070 -0.065 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165 -0.1 0.0 0.1 0.2 -0.080 -0.075 -0.070 -0.065 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165

Bar plot

  • Plot average miles per gallon for different cylinder types using autos_df dataframe.

In Seaborm package (Python), barplot function has an estimator paramter which will anyways estimate the average value of a numeric feature for each categorical feature.

In [40]:
#sn.barplot(y = 's_origin',
#           x = 'cylinders',
#           data = autos_df,
#          )

If it is just a count of values w.r.t a variable:

In [41]:
plot(autos_df,
    x = "cylinders",
    Geom.bar,
    Theme(bar_spacing = 4mm)
    )
Out[41]:
cylinders -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 -10.0 -9.5 -9.0 -8.5 -8.0 -7.5 -7.0 -6.5 -6.0 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 16.5 17.0 17.5 18.0 18.5 19.0 19.5 20.0 -10 0 10 20 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 -400 -380 -360 -340 -320 -300 -280 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 520 540 560 580 600 620 640 660 680 700 720 740 760 780 800 -500 0 500 1000 -400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800

Here we have to compute it separate before using plot function.

To plot average miles per gallon for different cylinder types using DataFrames and Gadfly:

  1. Use groupby method to group by cylinders and calulcate mean of mpg. Name this dataframe as mpg_cylinder_df.
  2. call plot() from gadfly with Geom.bar to plot mpg_cylinder_df.
In [42]:
mpg_cylinder_df = combine(groupby(autos_df,[:cylinders]), :mpg .=> [mean])
Out[42]:

5 rows × 2 columns

cylindersmpg_mean
Int64Float64
1814.9631
2429.2839
3619.9735
4320.55
5527.3667
In [43]:
bp1 = plot(mpg_cylinder_df, 
            x=:cylinders, 
            y=:mpg_mean,    
    
            Geom.bar(), #position=:dodge or :stack (but not needed for now.)
            
    
            Guide.title("1st Bar plot"),
    
    
            Theme( 
                  bar_spacing=4mm, 
                  key_position=:right),
    
            Coord.cartesian(xmin=2, xmax=9))
Out[43]:
cylinders -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 -10 0 10 20 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -40 -30 -20 -10 0 10 20 30 40 50 60 70 -30 -29 -28 -27 -26 -25 -24 -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 -30 0 30 60 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 mpg_mean 1st Bar plot
  • Draw the barplot for average miles per gallon grouped by cylinder and origin using DataFrames and Gadfly
    1. Use groupby method to group by cylinders, origin and calulcate mean of mpg. Name this dataframe as mpg_cylinders_origin_df.
    2. call plot() from gadfly with Geom.bar to plot mpg_cylinder_df.
In [44]:
#sn.barplot(x = 'cylinders',
#           y = 'mpg',
#           hue = 'origin',
#           data = autos_df)
In [45]:
mpg_cylinders_origin_df = combine(groupby(autos_df,[:cylinders,:s_origin]), :mpg .=> [mean])

mpg_cylinders_origin_df.label=string.(round.(Int, mpg_cylinders_origin_df.mpg_mean))

mpg_cylinders_origin_df
Out[45]:

9 rows × 4 columns

cylinderss_originmpg_meanlabel
Int64StringFloat64String
18114.963115
24331.595732
36119.645220
44228.106628
54128.01328
63320.5521
76323.883324
86220.120
95227.366727
In [46]:
bp2 = plot(mpg_cylinders_origin_df, 
            x=:cylinders, 
            y=:mpg_mean,    
            color=:s_origin,
    
            
            label=:label,Geom.label(position=:centered),  Stat.dodge(position=:stack),
            Geom.bar(position=:stack),
           
            
    
            Guide.title("1st Bar plot"),
    
            Scale.color_discrete_manual("skyblue","red","green"),
    
            Theme( 
                  bar_spacing=4mm, 
                  key_position=:right),
    
            Coord.cartesian(xmin=2, xmax=9))
Out[46]:
cylinders -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 -10 0 10 20 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 1 3 2 s_origin 15 32 20 28 28 21 24 20 27 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -125 -100 -75 -50 -25 0 25 50 75 100 125 150 175 200 225 -100 -95 -90 -85 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180 185 190 195 200 -100 0 100 200 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 mpg_mean 1st Bar plot

Exercise

Make the change in the above code to make the bar chart vertically dodged.

In [ ]:

A horizontal dodged chart can be plotted as below:

In [47]:
bp3 = plot(mpg_cylinders_origin_df, 
            y=:cylinders, 
            x=:mpg_mean,    
            color=:s_origin,
    
            
            label=:label,Geom.label(position=:right),  Stat.dodge( axis=:y),
            Geom.bar(position=:dodge,orientation=:horizontal),
               
    
            Guide.title("2nd Bar plot"),
            Guide.yticks(orientation=:vertical), 
            Guide.ylabel("# Cylinders"),
            Guide.xlabel("Avg miles per gallon"),
    
            Scale.color_discrete_manual("skyblue","red","green"),
    
            Theme( 
                  bar_spacing=4mm, 
                  key_position=:right),
    
            Coord.cartesian(ymin=2, ymax=9, yflip=true))
Out[47]:
Avg miles per gallon -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 -50 0 50 100 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 1 3 2 s_origin 15 32 20 28 28 21 24 20 27 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 -10 0 10 20 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 2 3 4 5 6 7 8 9 # Cylinders 2nd Bar plot
  • Draw a barplot which shows the count of cars w.r.t it origin and number of cylinders the car has.
  1. Use groupby method to group by cylinders, s_origin and and count the number of car. Name this dataframe as origin_cylinders_df.
  2. call plot() from gadfly with Geom.bar to plot mpg_cylinder_df.
In [48]:
origin_cylinders_df = combine(groupby(autos_df,[:s_origin,:cylinders]), nrow .=> [:count])
Out[48]:

9 rows × 3 columns

s_origincylinderscount
StringInt64Int64
118103
23469
31673
42461
51469
6334
7366
8264
9253
In [49]:
bp4 = plot(origin_cylinders_df, 
            x=:cylinders, 
            y=:count,    
            color=:s_origin,
    
            
            layer(
            label=string.(origin_cylinders_df.count),
            Geom.label(position=:centered), 
            Stat.dodge(position=:dodge)),
    
            Geom.bar(position=:dodge),
           
            
    
            Guide.title("Count of cars with cylinders and origin"),
    
            Scale.color_discrete_manual("skyblue","red","green"),
    
            Theme( 
                  bar_spacing=4mm, 
                  key_position=:right),
    
            Coord.cartesian(xmin=2, xmax=9))
Out[49]:
cylinders -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 -10 0 10 20 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 1 3 2 s_origin 103 69 73 61 69 4 6 4 3 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -200 -150 -100 -50 0 50 100 150 200 250 300 350 -150 -145 -140 -135 -130 -125 -120 -115 -110 -105 -100 -95 -90 -85 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180 185 190 195 200 205 210 215 220 225 230 235 240 245 250 255 260 265 270 275 280 285 290 295 300 -200 0 200 400 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 count Count of cars with cylinders and origin

Heatmap

Can be used to visulize a square matrix, say a correlation matrix. To visulize a rectangular data, Gadfly has Geom.rectbin:

http://gadflyjl.org/stable/gallery/geometries/#[Geom.rect](@ref),-[Geom.rectbin](@ref)

We will proceed with visualizing a correlation matrix. Strictly speaking, Pearson’s correlation requires that each dataset be normally distributed, and not necessarily zero-mean.

  • Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation.
  • Correlations of -1 or +1 imply an exact linear relationship.
  • Positive correlations imply that as x increases, so does y.
  • Negative correlations imply that as x increases, y decreases.
  • Plot the correlation between 'mpg', 'displacement','horsepower','acceleration','weight'
In [50]:
autos_cor_df = autos_df[:,["mpg",
                        "displacement",
                        "horsepower",
                        "weight",
                        "acceleration"]]

head(autos_cor_df)
Out[50]:

6 rows × 5 columns

mpgdisplacementhorsepowerweightacceleration
Float64Float64Float64Float64Float64
118.0307.0130.03504.012.0
215.0350.0165.03693.011.5
318.0318.0150.03436.011.0
416.0304.0150.03433.012.0
517.0302.0140.03449.010.5
615.0429.0198.04341.010.0

Below code using pandas will give correlation matrix in python

In [51]:
#auto_cor_df.corr()
In [52]:
cor(autos_df[:,"mpg"],autos_df[:,"horsepower"])
Out[52]:
-0.7784267838977761
In [53]:
cor(Matrix(autos_cor_df))
Out[53]:
5×5 Array{Float64,2}:
  1.0       -0.805127  -0.778427  -0.832244   0.423329
 -0.805127   1.0        0.897257   0.932994  -0.5438
 -0.778427   0.897257   1.0        0.864538  -0.689196
 -0.832244   0.932994   0.864538   1.0       -0.416839
  0.423329  -0.5438    -0.689196  -0.416839   1.0
In [ ]:

In [54]:
cor_matrix = NamedArray( cor( Matrix(autos_cor_df) ) )
Out[54]:
5×5 Named Array{Float64,2}
A ╲ B │         1          2          3          4          5
──────┼──────────────────────────────────────────────────────
1     │       1.0  -0.805127  -0.778427  -0.832244   0.423329
2     │ -0.805127        1.0   0.897257   0.932994    -0.5438
3     │ -0.778427   0.897257        1.0   0.864538  -0.689196
4     │ -0.832244   0.932994   0.864538        1.0  -0.416839
5     │  0.423329    -0.5438  -0.689196  -0.416839        1.0
In [55]:
cor_matrix = NamedArray( cor( Matrix(autos_cor_df)), 
            ([names(autos_cor_df);],[names(autos_cor_df);]), 
            ("Rows", "Cols"))
Out[55]:
5×5 Named Array{Float64,2}
 Rows ╲ Cols │          mpg  displacement    horsepower        weight  acceleration
─────────────┼─────────────────────────────────────────────────────────────────────
mpg          │          1.0     -0.805127     -0.778427     -0.832244      0.423329
displacement │    -0.805127           1.0      0.897257      0.932994       -0.5438
horsepower   │    -0.778427      0.897257           1.0      0.864538     -0.689196
weight       │    -0.832244      0.932994      0.864538           1.0     -0.416839
acceleration │     0.423329       -0.5438     -0.689196     -0.416839           1.0

A developers way of doign the above. The below code block is not mine.

Source: https://discourse.julialang.org/t/first-impression-of-dataframes-jl/49753/5

In [56]:
struct NoPrint end; Base.show(::IO, ::NoPrint) = nothing
NamedArray([i > j ? cor(autos_cor_df[!, i], autos_cor_df[!, j]) : 
        NoPrint() 
        for i in 2:ncol(autos_cor_df), 
            j in 1:ncol(autos_cor_df)-1],
                  (names(autos_cor_df)[2:end], names(autos_cor_df)[1:end-1]))
Out[56]:
4×4 Named Array{Any,2}
       A ╲ B │          mpg  displacement    horsepower        weight
─────────────┼───────────────────────────────────────────────────────
displacement │    -0.805127                                          
horsepower   │    -0.778427      0.897257                            
weight       │    -0.832244      0.932994      0.864538              
acceleration │     0.423329       -0.5438     -0.689196     -0.416839

End of developers code.

Proceeding with heatmap

Heatmap using seaborn in python.

In [57]:
#sn.heatmap(auto_clean_df.corr(),
#           annot = True,
#           cmap = sn.diverging_palette(250, 10, n = 25))

The basic heatmap using spy function from Gadfly

In [58]:
spy(cor_matrix)
Out[58]:
x -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 -4.6 -4.4 -4.2 -4.0 -3.8 -3.6 -3.4 -3.2 -3.0 -2.8 -2.6 -2.4 -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8 10.0 10.2 10.4 10.6 -5 0 5 10 15 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 0.5 0.0 -0.5 -1.0 1.0 Color h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 -4.6 -4.4 -4.2 -4.0 -3.8 -3.6 -3.4 -3.2 -3.0 -2.8 -2.6 -2.4 -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8 10.0 10.2 10.4 10.6 -5 0 5 10 15 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 y
In [59]:
cor_names= [names(cor_matrix,1);]  ##could have written (cor_matrix,2) as well
Out[59]:
5-element Array{String,1}:
 "mpg"
 "displacement"
 "horsepower"
 "weight"
 "acceleration"
In [60]:
spy(cor_matrix, 
    Scale.y_discrete(labels = i->cor_names[i]), 
    Scale.x_discrete(labels = i->cor_names[i]),
    Scale.color_continuous(colormap=Scale.lab_gradient("red", "white", "green")))
Out[60]:
x mpg displacement horsepower weight acceleration -0.5 -1.0 1.0 0.5 0.0 Color h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? mpg displacement horsepower weight acceleration y

scatter plot

Scatter plot is a cloud of points showing a joint distribution of two numerical variables where each point represents an observation from the dataset. Helps to understand the relationship between two numerical variables

  • Scatter plot mpg and weight of the cars
In [61]:
#sn.jointplot(x = 'mpg',
#             y = 'weight',
#            hue = 's_origin',
#             data = autos_df, 
#             kind = 'scatter'
#            )
In [62]:
Gadfly.push_theme(:dark) #default
In [63]:
plot(autos_df, x="mpg", y="weight", color = "s_origin", Geom.point)
Out[63]:
mpg -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 -50 0 50 100 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 1 3 2 s_origin h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -7.0×10³ -6.0×10³ -5.0×10³ -4.0×10³ -3.0×10³ -2.0×10³ -1.0×10³ 0 1.0×10³ 2.0×10³ 3.0×10³ 4.0×10³ 5.0×10³ 6.0×10³ 7.0×10³ 8.0×10³ 9.0×10³ 1.0×10⁴ 1.1×10⁴ 1.2×10⁴ 1.3×10⁴ -6.00×10³ -5.80×10³ -5.60×10³ -5.40×10³ -5.20×10³ -5.00×10³ -4.80×10³ -4.60×10³ -4.40×10³ -4.20×10³ -4.00×10³ -3.80×10³ -3.60×10³ -3.40×10³ -3.20×10³ -3.00×10³ -2.80×10³ -2.60×10³ -2.40×10³ -2.20×10³ -2.00×10³ -1.80×10³ -1.60×10³ -1.40×10³ -1.20×10³ -1.00×10³ -8.00×10² -6.00×10² -4.00×10² -2.00×10² 0 2.00×10² 4.00×10² 6.00×10² 8.00×10² 1.00×10³ 1.20×10³ 1.40×10³ 1.60×10³ 1.80×10³ 2.00×10³ 2.20×10³ 2.40×10³ 2.60×10³ 2.80×10³ 3.00×10³ 3.20×10³ 3.40×10³ 3.60×10³ 3.80×10³ 4.00×10³ 4.20×10³ 4.40×10³ 4.60×10³ 4.80×10³ 5.00×10³ 5.20×10³ 5.40×10³ 5.60×10³ 5.80×10³ 6.00×10³ 6.20×10³ 6.40×10³ 6.60×10³ 6.80×10³ 7.00×10³ 7.20×10³ 7.40×10³ 7.60×10³ 7.80×10³ 8.00×10³ 8.20×10³ 8.40×10³ 8.60×10³ 8.80×10³ 9.00×10³ 9.20×10³ 9.40×10³ 9.60×10³ 9.80×10³ 1.00×10⁴ 1.02×10⁴ 1.04×10⁴ 1.06×10⁴ 1.08×10⁴ 1.10×10⁴ 1.12×10⁴ 1.14×10⁴ 1.16×10⁴ 1.18×10⁴ 1.20×10⁴ -1×10⁴ 0 1×10⁴ 2×10⁴ -6.00×10³ -5.50×10³ -5.00×10³ -4.50×10³ -4.00×10³ -3.50×10³ -3.00×10³ -2.50×10³ -2.00×10³ -1.50×10³ -1.00×10³ -5.00×10² 0 5.00×10² 1.00×10³ 1.50×10³ 2.00×10³ 2.50×10³ 3.00×10³ 3.50×10³ 4.00×10³ 4.50×10³ 5.00×10³ 5.50×10³ 6.00×10³ 6.50×10³ 7.00×10³ 7.50×10³ 8.00×10³ 8.50×10³ 9.00×10³ 9.50×10³ 1.00×10⁴ 1.05×10⁴ 1.10×10⁴ 1.15×10⁴ 1.20×10⁴ weight

The problem with scatter plot in over plotting. When dataset is huge, dots of the scatterplot tend to overlap, and graphic becomes unreadable and meanigless.

Hexabin plot

In one dimension straight line segments are the only possible shape for bin in a histogram. However for data in two dimensions bins can be more general shape (rectangular/Hexagon):

  • The obvious strategy is to choose a rectangular bin to build a histogram. Imagine the above scatter plot being filled with rectangular boxes (where the boxes represents the bins in horizontal and vertical direction). The count of values in each of the bins can be colored with gradient fill.
  • The hexagon tiling uses a hexagon shape for binning. The same scatterplot chart can be filled with hexagon shapes and the count of points falling in each hexagon can be used to fill the shape

Hex plot mpg and weight of the cars

Exercise

Using Geom.hexbin from Gadfly to draw the hexabin plot

In [ ]:

This is how it is done in seaborn (Python):

In [64]:
#sn.jointplot(x = 'mpg',
#             y = 'weight',
#             data = autos_df, color = 'k',
#             kind = 'hex'
#            )

Boxplot

Box plot is a graphical representation of numerical data that can be used to understand the variability of the data and the existence of outliers.

A boxplot is a graph that gives you a good indication of how the values in the data are spread out.

To generate a box plot: Assume data as : 98, 77, 85, 88, 82, 83, 87, 67, 100, 63, 105

  • Arrange data in ascending order: 63, 67, 77, 82, 83, 85, 87, 88, 98, 100, 105 Calculate the median (middle value of the data, 85). This is Q2Calculate the median of the first half of the data, 77). This is Q1. *Calculate the median of the second half of the data, 98). This is Q3.
  • The box joins Q1 to Q3 (contains middle 50% of data).
  • IQR = Q3 - Q1 = 11
  • LIF = Q1 - 1.5*IQR = 60.5 ; UIF = Q3 + 1.5 IQR = 114.5
  • The point adjancent to LIF is 67 and point adjancent to UIF is 105.
  • The smallest observation greater than or equal to LIF builds lower whisker.
  • The largest observation less than or equal to UIF builds upper whisker.

Point outside the fences are outliers.

Intrepret boxplot:

If wide box and long whiskers, then maybe the data doesn’t cluster. If box is small and the whiskers are short, then probably your data does indeed cluster If box is small and the whiskers are long, then maybe the data clusters, but have some “outliers”

  • Plot the outliers in miles per gallon w.r.t. cylinders and origin.
In [65]:
Gadfly.push_theme(:default)
In [66]:
bp1 = plot(autos_df, x=:cylinders, y=:mpg, color=:s_origin,
    Geom.boxplot, Theme(boxplot_spacing=0.1mm),
    
    Guide.title("Boxplot of mpg with #cylinders"),
    Scale.color_discrete_manual("skyblue","red","green")
)
Out[66]:
cylinders -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8 10.0 10.2 10.4 10.6 10.8 11.0 11.2 11.4 11.6 11.8 12.0 12.2 12.4 12.6 12.8 13.0 -5 0 5 10 15 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 1 3 2 s_origin h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 -50 0 50 100 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 mpg Boxplot of mpg with #cylinders

Concept of subgrid

What if we want to bring more than 4 dimensions of data in the same plot.

In [67]:
set_default_plot_size(24cm, 20cm)
In [68]:
subgrid1_df = combine(groupby(autos_df,[:s_origin,:cylinders]), :mpg .=> [mean])
Out[68]:

9 rows × 3 columns

s_origincylindersmpg_mean
StringInt64Float64
11814.9631
23431.5957
31619.6452
42428.1066
51428.013
63320.55
73623.8833
82620.1
92527.3667
In [69]:
gp1 = plot(subgrid1_df, 
            x=:s_origin, 
            y=:mpg_mean,    
            #xgroup=:cylinders, ##it can be any other categorical variable.
            ygroup=:cylinders,
            color=:s_origin, ## This can be some other categorical variable. Putting origin does not give any info.
    
            
            Geom.subplot_grid(
            layer(
            Geom.bar(position=:dodge)
            ),
        
            #To label the bars.
            layer(
            label=string.(round.(Int,subgrid1_df.mpg_mean)),
            Geom.label(position=:above)#, 
            #Stat.dodge(position=:dodge)
            )),
            
    
            Guide.title("Avg miles per gallon with cylinders and origin"),
    
            Scale.color_discrete_manual("skyblue","red","green"),
    
            Theme( 
                  bar_spacing=4mm, 
                  key_position=:right)
)
Out[69]:
s_origin 1 3 2 s_origin 1 3 2 27 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 -50 0 50 100 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 5 21 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 -50 0 50 100 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 3 20 24 20 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 -50 0 50 100 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 6 32 28 28 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 -50 0 50 100 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 4 15 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 -50 0 50 100 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 8 mpg_mean by cylinders Avg miles per gallon with cylinders and origin
In [70]:
subgrid2_df = combine(groupby(autos_df,[:model_year,:s_origin,:cylinders]), :mpg .=> [mean])

tail(subgrid2_df)
Out[70]:

6 rows × 4 columns

model_years_origincylindersmpg_mean
Int64StringInt64Float64
1813624.8
2811826.6
3821430.0625
4822440.0
5823434.8889
6821628.3333
In [71]:
gp2 = plot(subgrid2_df, 
            x=:model_year, 
            y=:mpg_mean,    
            #xgroup=:model_year, ##it can be any other categorical variable.
            ygroup=:cylinders,
            color=:s_origin, ## This can be some other categorical variable. Putting origin does not give any info.
    
            
            Geom.subplot_grid(
            layer(
            Geom.line()
            ),
        
            layer(
            label=string.(round.(Int,subgrid2_df.mpg_mean)),
            Geom.label(position=:dynamic, hide_overlaps=true)
            )),
            
            
    
            Guide.title("Avg miles per gallon with cylinders and origin"),
    
            Scale.color_discrete_manual("skyblue","red","green"),
    
    
            Scale.x_continuous(;minvalue=65, maxvalue=80),
    
            Theme(panel_stroke=colorant"black",
                  grid_line_width=0mm) ## we can use style() or theme to pass these parameters.
)
Out[71]:
model_year 1 3 2 s_origin 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 40 60 80 100 120 45 50 55 60 65 70 75 80 85 90 95 100 105 20 25 36 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 -50 0 50 100 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 5 19 18 22 24 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 -50 0 50 100 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 3 20 18 19 20 17 18 21 19 16 19 22 20 17 23 19 33 21 31 25 28 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 -50 0 50 100 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 6 26 25 30 25 29 26 23 22 24 21 20 29 26 27 28 23 24 25 26 31 30 28 29 32 28 30 31 32 33 37 37 28 31 35 31 30 40 35 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 -50 0 50 100 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 4 14 13 14 13 14 16 15 16 19 19 27 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 -50 0 50 100 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 8 mpg_mean by cylinders Avg miles per gallon with cylinders and origin

Exercise

Draw a boxplot using Geom.subplot_grid to split the boxplot with ygroup = s_origin. Refer Geom.subplot_grid used to plot barplot in above section.

In [ ]:

Thank you